3.5. Air Quality in Dar es Salaam š¹šæ
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
# Import libraries here
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
from IPython.display import VimeoVideo
from pymongo import MongoClient
from sklearn.metrics import mean_absolute_error
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.ar_model import AutoReg
Prepare Data¶
Connect¶
š ļø Instruction: Locate the IP address of the machine running MongoDB and assign it to the variable host. Make sure to use a string (i.e., wrap the IP in quotes).
ā ļø Note: The IP address is dynamic ā it may change every time you start the lab. Always check the current IP before proceeding.

host = "192.176.214.2"
Task 3.5.1
client = MongoClient(host=host, port=27017)
db = client["air-quality"]
dar = db['dar-es-salaam']
Explore¶
Task 3.5.2
sites = dar.distinct("metadata.site")
sites
[11, 23]
Task 3.5.3
result = dar.aggregate(
[{ '$group':
{ '_id': '$metadata.site',
'count': {'$count': {}}
}}
]
)
readings_per_site = list(result)
readings_per_site
[{'_id': 11, 'count': 173242}, {'_id': 23, 'count': 60020}]
Import¶
Task 3.5.4
def wrangle(collection):
results = collection.find(
{'metadata.site':11, "metadata.measurement":"P2"},
projection = {"P2":1, "timestamp":1, "_id":0},
)
#read data into dataframe
df = pd.DataFrame(list(results)).set_index("timestamp")
#localize timezone
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Dar_es_Salaam")
#remove outliers
df = df[df["P2"]<100]
#resample the data and impute the data
y = df["P2"].resample("1H").mean().fillna(method="ffill")
return y
Use your wrangle function to query the dar collection and return your cleaned results.
y = wrangle(dar)
print(type(y))
<class 'pandas.core.series.Series'>
Explore Some More¶
Task 3.5.5
fig, ax = plt.subplots(figsize=(15, 6))
y.plot(ax=ax, xlabel="Date", ylabel="PM2.5 Level", title= "Dar es Salaam PM2.5 Levels" )
# use ax=ax in your plot
<Axes: title={'center': 'Dar es Salaam PM2.5 Levels'}, xlabel='Date', ylabel='PM2.5 Level'>
Task 3.5.6
fig, ax = plt.subplots(figsize=(15, 6))
y.to_frame()["P2"].rolling(window=168).mean().plot(ax=ax, xlabel="Date", ylabel="PM2.5 Level", title= "Dar es Salaam PM2.5 Levels, 7-Day Rolling Average" )
# use ax=ax in your plot
<Axes: title={'center': 'Dar es Salaam PM2.5 Levels, 7-Day Rolling Average'}, xlabel='Date', ylabel='PM2.5 Level'>
Task 3.5.7
fig, ax = plt.subplots(figsize=(15, 6))
# use ax=ax in your plot
plot_acf(y, ax=ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient")
plt.title("Dar es Salaam PM2.5 Readings, ACF")
Text(0.5, 1.0, 'Dar es Salaam PM2.5 Readings, ACF')
Task 3.5.8
fig, ax = plt.subplots(figsize=(15, 6))
# Use ax=ax in your plot
plot_pacf(y, ax=ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient")
plt.title("Dar es Salaam PM2.5 Readings, PACF")
Text(0.5, 1.0, 'Dar es Salaam PM2.5 Readings, PACF')
Split¶
Task 3.5.9
cutoff_test = int(len(y)*0.90)
y_train = y.iloc[:cutoff_test]
y_test = y.iloc[cutoff_test:]
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
y_train shape: (1944,) y_test shape: (216,)
Build Model¶
Baseline¶
Task 3.5.10
y_train_mean = y_train.mean()
y_pred_baseline = [y_train_mean]* len(y_train)
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
print("Mean P2 Reading:", y_train_mean)
print("Baseline MAE:", mae_baseline)
Mean P2 Reading: 8.57142319061077 Baseline MAE: 4.053101181299159
Iterate¶
Task 3.5.11
Tip: In this task, you'll need to combine the model you learned about in Task 3.3.8 with the hyperparameter tuning technique you learned in Task 3.4.9.
# Create range to test different lags
p_params = range(1, 31)
# Create empty list to hold mean absolute error scores
maes = []
# Iterate through all values of p in `p_params`
for p in p_params:
# Build model
model = AutoReg(y_train, lags=p).fit()
# Make predictions on training data, dropping null values caused by lag
y_pred = model.predict().dropna()
# Calculate mean absolute error for training data vs predictions
mae = mean_absolute_error(y_train.iloc[p:], y_pred)
# Append `mae` to list `maes`
maes.append(mae)
# Put list `maes` into Series with index `p_params`
mae_series = pd.Series(maes, name="mae", index=p_params)
# Inspect head of Series
mae_series.head()
1 1.059376 2 1.045182 3 1.032489 4 1.032147 5 1.031022 Name: mae, dtype: float64
Task 3.5.12
best_p = 26
best_model = AutoReg(y_train, lags=best_p).fit()
Task 3.5.13
y_train_resid = best_model.resid
y_train_resid.name = "residuals"
y_train_resid.head()
timestamp 2018-01-02 05:00:00+03:00 -0.412913 2018-01-02 06:00:00+03:00 1.484934 2018-01-02 07:00:00+03:00 1.672359 2018-01-02 08:00:00+03:00 -0.368030 2018-01-02 09:00:00+03:00 -0.536868 Freq: H, Name: residuals, dtype: float64
In the following task, you'll notice a small change in how plots are created compared to what you saw in the lessons.
While the lessons use the global matplotlib method like plt.plot(...), in this task, you are expected to use the object-oriented (OOP) API instead.
This means creating your plots using fig, ax = plt.subplots() and then calling plotting methods on the ax object, such as ax.plot(...), ax.hist(...), or ax.scatter(...).
If you're using pandasā or seabornās built-in plotting methods (like df.plot() or sns.lineplot()), make sure to pass the ax=ax argument so that the plot is rendered on the correct axes.
This approach is considered best practice and will be used consistently across all graded tasks that involve matplotlib.
Task 3.5.14
# Plot histogram of residuals
fig, ax = plt.subplots()
# Use ax=ax in your plot
y_train_resid.hist()
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.title("Best Model, Training Residuals")
Text(0.5, 1.0, 'Best Model, Training Residuals')
Task 3.5.15
fig, ax = plt.subplots(figsize=(15, 6))
# Use ax=ax in your plot
plot_acf(y_train_resid, ax=ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient")
plt.title("Dar es Salaam, Training Residuals ACF")
Text(0.5, 1.0, 'Dar es Salaam, Training Residuals ACF')
Evaluate¶
Task 3.5.16
y_pred_wfv = pd.Series()
history = y_train.copy()
for i in range(len(y_test)):
model = AutoReg(history, lags=26).fit()
next_pred = model.forecast()
y_pred_wfv = y_pred_wfv.append(next_pred)
history = history.append(y_test[next_pred.index])
y_pred_wfv.name = "prediction"
y_pred_wfv.index.name = "timestamp"
y_pred_wfv.head()
timestamp 2018-03-23 03:00:00+03:00 10.414744 2018-03-23 04:00:00+03:00 8.269589 2018-03-23 05:00:00+03:00 15.178677 2018-03-23 06:00:00+03:00 33.475398 2018-03-23 07:00:00+03:00 39.571363 Freq: H, Name: prediction, dtype: float64
Task 3.5.17
# Enter y_pred_wfv at ... (Ellipsis) to see the test mean absolute error
test_mae = mean_absolute_error(y_test, y_pred_wfv)
print("Test MAE (walk forward validation):", round(test_mae, 2))
Test MAE (walk forward validation): 3.97
Communicate Results¶
Task 3.5.18
df_pred_test = pd.DataFrame({"y_test": y_test, "y_pred_wfv": y_pred_wfv}, index=y_test.index)
fig = px.line(df_pred_test)
fig.update_layout(
title="Dar es Salaam, WFV Predictions",
xaxis_title="Date",
yaxis_title="PM2.5 Level",
)
fig = px.line(
df_pred_test,
x=df_pred_test.index,
y=['y_test', 'y_pred_wfv'],
labels={'x': 'Date', 'value': 'PM2.5 Level', 'variable': 'Legend'},
title='Dar es Salaam, WFV Predictions'
)
fig.show()
Copyright 2024 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.